Kennebec County
Unsupervised decoding of encoded reasoning using language model interpretability
As large language models become increasingly capable, there is growing concern that they may develop reasoning processes that are encoded or hidden from human oversight. To investigate whether current interpretability techniques can penetrate such encoded reasoning, we construct a controlled testbed by fine-tuning a reasoning model (DeepSeek-R1-Distill-Llama-70B) to perform chain-of-thought reasoning in ROT-13 encryption while maintaining intelligible English outputs. We evaluate mechanistic interpretability methods--in particular, logit lens analysis--on their ability to decode the model's hidden reasoning process using only internal activations. We show that logit lens can effectively translate encoded reasoning, with accuracy peaking in intermediate-to-late layers. Finally, we develop a fully unsupervised decoding pipeline that combines logit lens with automated paraphrasing, achieving substantial accuracy in reconstructing complete reasoning transcripts from internal model representations. These findings suggest that current mechanistic interpretability techniques may be more robust to simple forms of encoded reasoning than previously understood. Our work provides an initial framework for evaluating interpretability methods against models that reason in non-human-readable formats, contributing to the broader challenge of maintaining oversight over increasingly capable AI systems.
- North America > United States > Illinois > Sangamon County > Springfield (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.07)
- North America > United States > California > Sacramento County > Sacramento (0.05)
- (22 more...)
Computational Analysis of Conversation Dynamics through Participant Responsivity
Hughes, Margaret, Roy, Brandon, Poole-Dayan, Elinor, Roy, Deb, Kabbara, Jad
Growing literature explores toxicity and polarization in discourse, with comparatively less work on characterizing what makes dialogue prosocial and constructive. We explore conversational discourse and investigate a method for characterizing its quality built upon the notion of ``responsivity'' -- whether one person's conversational turn is responding to a preceding turn. We develop and evaluate methods for quantifying responsivity -- first through semantic similarity of speaker turns, and second by leveraging state-of-the-art large language models (LLMs) to identify the relation between two speaker turns. We evaluate both methods against a ground truth set of human-annotated conversations. Furthermore, selecting the better performing LLM-based approach, we characterize the nature of the response -- whether it responded to that preceding turn in a substantive way or not. We view these responsivity links as a fundamental aspect of dialogue but note that conversations can exhibit significantly different responsivity structures. Accordingly, we then develop conversation-level derived metrics to address various aspects of conversational discourse. We use these derived metrics to explore other conversations and show that they support meaningful characterizations and differentiations across a diverse collection of conversations.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Middle East > Republic of Türkiye > Konya Province > Konya (0.04)
- North America > United States > Texas (0.04)
- (13 more...)
- Government (0.93)
- Health & Medicine (0.67)
Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
Kodali, Prashant, Shivkumar, Vaishnavi, Joshi, Swarang, Choudhary, Monojit, Kumaraguru, Ponnurangam, Shrivastava, Manish
We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > India > Maharashtra > Pune (0.05)
- (14 more...)
EDUMATH: Generating Standards-aligned Educational Math Word Problems
Christ, Bryan R., Molitz, Penelope, Kropko, Jonathan, Hartvigsen, Thomas
Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students' interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models' MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models' MWPs relative to human-written MWPs but consistently prefer our customized MWPs.
- North America > United States > Virginia (0.41)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- (6 more...)
- Leisure & Entertainment > Sports (1.00)
- Education > Educational Setting > K-12 Education (0.86)
- Leisure & Entertainment > Games > Computer Games (0.67)
What Has Been Lost with Synthetic Evaluation?
Gill, Alexander, Ravichander, Abhilasha, Marasović, Ana
Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Sweden (0.14)
- Europe > Denmark (0.14)
- (33 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Leisure & Entertainment > Sports > Football (1.00)
- Government > Regional Government > Europe Government > United Kingdom Government (1.00)
- Education (1.00)
Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research
Petersen, Molly R, Stevenson, Claire E, van der Plas, Lonneke
Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Asia > Singapore (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (25 more...)
- Health & Medicine (1.00)
- Education (1.00)
- Law (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Analogical Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
The QCET Taxonomy of Standard Quality Criterion Names and Definitions for the Evaluation of NLP Systems
Belz, Anya, Mille, Simon, Thomson, Craig
Prior work has shown that two NLP evaluation experiments that report results for the same quality criterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality, and the comparability implied by the name can be misleading. Not knowing when two evaluations are comparable in this sense means we currently lack the ability to draw reliable conclusions about system quality on the basis of multiple, independently conducted evaluations. This in turn hampers the ability of the field to progress scientifically as a whole, a pervasive issue in NLP since its beginning (Sparck Jones, 1981). It is hard to see how the issue of unclear comparability can be fully addressed other than by the creation of a standard set of quality criterion names and definitions that the several hundred quality criterion names actually in use in the field can be mapped to, and grounded in. Taking a strictly descriptive approach, the QCET Quality Criteria for Evaluation Taxonomy derives a standard set of quality criterion names and definitions from three surveys of evaluations reported in NLP, and structures them into a hierarchy where each parent node captures common aspects of its child nodes. We present QCET and the resources it consists of, and discuss its three main uses in (i) establishing comparability of existing evaluations, (ii) guiding the design of new evaluations, and (iii) assessing regulatory compliance.
- Research Report (1.00)
- Overview (0.67)
- Law (1.00)
- Government (1.00)
- Leisure & Entertainment (0.92)
- Health & Medicine > Therapeutic Area (0.46)
Findings of the Third Automatic Minuting (AutoMin) Challenge
Shinde, Kartik, Besacier, Laurent, Bojar, Ondrej, Thonet, Thibaut, Ghosal, Tirthankar
This paper presents the third edition of AutoMin, a shared task on automatic meeting summarization into minutes. In 2025, AutoMin featured the main task of minuting, the creation of structured meeting minutes, as well as a new task: question answering (QA) based on meeting transcripts. The minuting task covered two languages, English and Czech, and two domains: project meetings and European Parliament sessions. The QA task focused solely on project meetings and was available in two settings: monolingual QA in English, and cross-lingual QA, where questions were asked and answered in Czech based on English meetings. Participation in 2025 was more limited compared to previous years, with only one team joining the minuting task and two teams participating in QA. However, as organizers, we included multiple baseline systems to enable a comprehensive evaluation of current (2025) large language models (LLMs) on both tasks.
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
- (4 more...)
NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization
Summarizing long-form narratives--such as books, movies, and TV scripts--requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline--without requiring fine-tuning. Our approach introduces two key innovations: (1) Dialogue-to-Description Transformation: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi-LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (16 more...)
- Media (0.92)
- Law Enforcement & Public Safety (0.67)
- Government > Military (0.67)
- Leisure & Entertainment > Social Events (0.67)
Data extraction and processing methods to aid the study of driving behaviors at intersections in naturalistic driving
Pundlik, Shrinivas, Choe, Seonggyu, Baker, Patrick, Lee, Chen-Yuan, Al-Madi, Naser, Bowers, Alex R., Luo, Gang
Naturalistic driving studies use devices in participants' own vehicles to record daily driving over many months. Due to diverse and extensive amounts of data recorded, automated processing is necessary. This report describes methods to extract and characterize driver head scans at intersections from data collected from an in-car recording system that logged vehicle speed, GPS location, scene videos, and cabin videos. Custom tools were developed to mark the intersections, synchronize location and video data, and clip the cabin and scene videos for +/-100 meters from the intersection location. A custom-developed head pose detection AI model for wide angle head turns was run on the cabin videos to estimate the driver head pose, from which head scans >20 deg were computed in the horizontal direction. The scene videos were processed using a YOLO object detection model to detect traffic lights, stop signs, pedestrians, and other vehicles on the road. Turning maneuvers were independently detected using vehicle self-motion patterns. Stop lines on the road surface were detected using changing intensity patterns over time as the vehicle moved. The information obtained from processing the scene videos, along with the speed data was used in a rule-based algorithm to infer the intersection type, maneuver, and bounds. We processed 190 intersections from 3 vehicles driven in cities and suburban areas from Massachusetts and California. The automated video processing algorithm correctly detected intersection signage and maneuvers in 100% and 94% of instances, respectively. The median [IQR] error in detecting vehicle entry into the intersection was 1.1[0.4-4.9] meters and 0.2[0.1-0.54] seconds. The median overlap between ground truth and estimated intersection bounds was 0.88[0.82-0.93].
- North America > United States > California (0.24)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Minnesota > Hennepin County > Bloomington (0.04)
- (3 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)
- Transportation > Ground > Road (1.00)
- Health & Medicine (1.00)